inference system
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.87)
- Health & Medicine (0.93)
- Information Technology (0.67)
- Education > Educational Setting (0.67)
FairBatching: Fairness-Aware Batch Formation for LLM Inference
Lyu, Hongtao, Liu, Boyue, Wu, Mingyu, Chen, Haibo
Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low Time-Per-Output-Token, or TPOT) for ongoing requests. Existing stall-free batching schedulers proposed by Sarathi, while effective at preventing decode stalls, introduce significant computational unfairness. They prioritize decode tasks excessively, simultaneously leading to underutilized decode slack and unnecessary prefill queuing delays, which collectively degrade the system's overall quality of service (QoS). This work identifies the root cause of this unfairness: the non-monotonic nature of Time-Between-Tokens (TBT) as a scheduling metric and the rigid decode-prioritizing policy that fails to adapt to dynamic workload bursts. We therefore propose FairBatching, a novel LLM inference scheduler that enforces fair resource allocation between prefill and decode tasks. It features an adaptive batch capacity determination mechanism, which dynamically adjusts the computational budget to improve the GPU utilization without triggering SLO violations. Its fair and dynamic batch formation algorithm breaks away from the decode-prioritizing paradigm, allowing computation resources to be reclaimed from bursting decode tasks to serve prefill surges, achieving global fairness. Furthermore, FairBatching provides a novel load estimation method, enabling more effective coordination with upper-level schedulers. Implemented and evaluated on realistic traces, FairBatching significantly reduces TTFT tail latency by up to 2.29x while robustly maintaining TPOT SLOs, achieving overall 20.0% improvement in single-node capacity and 54.3% improvement in cluster-level capacity.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Diego County > Carlsbad (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Energy (1.00)
- Information Technology (0.68)
- North America > United States > Iowa (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Health & Medicine (0.93)
- Information Technology (0.67)
- Education > Educational Setting (0.67)
Situation Model of the Transport, Transport Emissions and Meteorological Conditions
Benes, V., Svitek, M., Michalikova, A., Melicherik, M.
Air pollution in cities and the possibilities of reducing this pollution represents one of the most important factors that today's society has to deal with. This paper focuses on a systemic approach to traffic emissions with their relation to meteorological conditions, analyzing the effect of weather on the quantity and dispersion of traffic emissions in a city. Using fuzzy inference systems (FIS) the model for prediction of changes in emissions depending on various conditions is developed. The proposed model is based on traffic, meteorology and emission data measured in Prague, Czech Republic. The main objective of the work is to provide insight into how urban planners and policymakers can plan and manage urban transport more effectively with environmental protection in mind.
- Europe > Czechia > Prague (0.29)
- Europe > Slovakia > Banska Bystrica > Banská Bystrica (0.04)
- Transportation (0.89)
- Government (0.67)
A Fuzzy Approach to Project Success: Measuring What Matters
Granja-Correia, João, Hernández-Linares, Remedios, Ferranti, Luca, Rego, Arménio
This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.
On Evaluating Performance of LLM Inference Serving Systems
Agrawal, Amey, Kedia, Nitin, Agarwal, Anmol, Mohan, Jayashree, Kwatra, Nipun, Kundu, Souvik, Ramjee, Ramachandran, Tumanov, Alexey
The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
Orthogonal Factor-Based Biclustering Algorithm (BCBOF) for High-Dimensional Data and Its Application in Stock Trend Prediction
Biclustering is an effective technique in data mining and pattern recognition. Biclustering algorithms based on traditional clustering face two fundamental limitations when processing high-dimensional data: (1) The distance concentration phenomenon in high-dimensional spaces leads to data sparsity, rendering similarity measures ineffective; (2) Mainstream linear dimensionality reduction methods disrupt critical local structural patterns. To apply biclustering to high-dimensional datasets, we propose an orthogonal factor-based bicluster-ing algorithm (BCBOF). First, we constructed orthogonal factors in the vector space of the high-dimensional dataset. Then, we performed clustering using the coordinates of the original data in the orthogonal subspace as clustering targets. Finally, we obtained biclustering results of the original dataset. Since dimensionality reduction was applied before clustering, the proposed algorithm effectively mitigated the data sparsity problem caused by high dimensionality. Additionally, we applied this biclustering algorithm to stock technical indicator combinations and stock price trend prediction. Biclustering results were transformed into fuzzy rules, and we incorporated profit-preserving and stop-loss rules into the rule set, ultimately forming a fuzzy inference system for stock price trend predictions and trading signals. The results showed that our algorithm outperformed other biclustering techniques. To validate the effectiveness of the fuzzy inference system, we conducted virtual trading experiments using historical data from 10 A-share stocks. The experimental results showed that the generated trading strategies yielded higher returns for investors. Introduction Since its initial proposal by Cheng and Church[1], biclustering has evolved into a sophisticated analytical approach.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > New York (0.04)
- (4 more...)
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
Ye, Shengyuan, Ouyang, Bei, Zeng, Liekang, Qian, Tianyi, Chu, Xiaowen, Tang, Jian, Chen, Xu
--Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks. Traditionally deployed in cloud datacenters, LLMs are now increasingly moving towards more accessible edge platforms to protect sensitive user data and ensure privacy preservation. The limited computational resources of individual edge devices, however, can result in excessively prolonged inference latency and overwhelmed memory usage. While existing research has explored collaborative edge computing to break the resource wall of individual devices, these solutions yet suffer from massive communication overhead and under-utilization of edge resources. Furthermore, they focus exclusively on optimizing the prefill phase, neglecting the crucial autoregressive decoding phase for generative LLMs. T o address that, we propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference. Jupiter introduces a flexible pipelined architecture as a principle and differentiates its system design according to the differentiated characteristics of the prefill and decoding phases. For prefill phase, Jupiter submits a novel intra-sequence pipeline parallelism and develops a meticulous parallelism planning strategy to maximize resource efficiency; For decoding, Jupiter devises an effective outline-based pipeline parallel decoding mechanism combined with speculative decoding, which further magnifies inference acceleration. Extensive evaluation based on realistic implementation demonstrates that Jupiter remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 26. 1 end-to-end latency reduction while rendering on-par generation quality. I NTRODUCTION The emergence of generative large language models (LLMs) has attracted widespread attention from both industry and academia owing to their exceptional capabilities in a wide range of artificial intelligence (AI) tasks. These models, widely deployed in cloud datacenters equipped with powerful server-grade GPUs, have driven increasing intelligent edge applications such as ChatBot [1] and smart-home AI agent [2].
- North America > United States (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Hong Kong (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
Enhanced Sentiment Analysis of Iranian Restaurant Reviews Utilizing Sentiment Intensity Analyzer & Fuzzy Logic
Rokhva, Shayan, Teimourpour, Babak, Babaei, Romina
This research presents an advanced sentiment analysis framework studied on Iranian restaurant reviews, combining fuzzy logic with conventional sentiment analysis techniques to assess both sentiment polarity and intensity. A dataset of 1266 reviews, alongside corresponding star ratings, was compiled and preprocessed for analysis. Initial sentiment analysis was conducted using the Sentiment Intensity Analyzer (VADER), a rule-based tool that assigns sentiment scores across positive, negative, and neutral categories. However, a noticeable bias toward neutrality often led to an inaccurate representation of sentiment intensity. To mitigate this issue, based on a fuzzy perspective, two refinement techniques were introduced, applying square-root and fourth-root transformations to amplify positive and negative sentiment scores while maintaining neutrality. This led to three distinct methodologies: Approach 1, utilizing unaltered VADER scores; Approach 2, modifying sentiment values using the square root; and Approach 3, applying the fourth root for further refinement. A Fuzzy Inference System incorporating comprehensive fuzzy rules was then developed to process these refined scores and generate a single, continuous sentiment value for each review based on each approach. Comparative analysis, including human supervision and alignment with customer star ratings, revealed that the refined approaches significantly improved sentiment analysis by reducing neutrality bias and better capturing sentiment intensity. Despite these advancements, minor over-amplification and persistent neutrality in domain-specific cases were identified, leading us to propose several future studies to tackle these occasional barriers. The study's methodology and outcomes offer valuable insights for businesses seeking a more precise understanding of consumer sentiment, enhancing sentiment analysis across various industries.
- Europe > Hungary > Budapest > Budapest (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Research Report > New Finding (1.00)
- Overview (0.93)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Collaborative Speculative Inference for Efficient LLM Inference Serving
Gao, Luyao, Liu, Jianchun, Xu, Hongli, Huang, Liusheng
Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
- North America > United States (0.04)
- Asia > China > Anhui Province > Hefei (0.04)